library(tidyverse)
library(janitor)
library(lubridate)
library(GGally)
library(ggfortify)
library(modelr)
library(tidytext)
library(wordcloud)
library(leaflet)
library(ggridges)
library(ggthemes)
library(relaimpo)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)

Introduction

About the Data

The data is about New York City Airbnb listings in 2019.

The data includes information on, prices, New York neighbourhoods and room types, to name a few. Geospatial coordinates are also present, offering valuable insight into Airbnb locations.

Project Objectives

Primary aim: - Analyse the determinants of airbnb prices - Offer recommendations for customers booking airbnbs - Provide insights for hosts - adopt an appropriate pricing strategy

Overview

  • Ethical Considerations
  • Data Cleaning and Wrangling
  • Exploratory Analysis
  • Text mining
  • Geo-spatial Analysis
  • Model Building
    • Univariate Regression
    • Multivariate Regression
  • Analysis Conclusions
prices <- read_csv("raw_data/AB_NYC_2019.csv") %>% clean_names()

Ethical Considerations

The data contains information on host names and unique IDs. To avoid any ethical issues I chose to remove these variables.

Clean Data

Check Missing Values

Many missing values in last_review and reviews_per_month (10052 rows)

  • last_review will be dropped as inconsequential variable
  • values dropped from reviews_per_month
    • coalescing with mean may warp data too much

Data Cleaning

Exploratory Analysis

Total Number of Bookings per Borough

Manhattan and Brooklyn both by far the most popular areas for Airbnb listings.

Two central Boroughs which may indicate the main reason people book is for holidays / Tourism.

Average Price per Neighbourhood Group


# Average Price per Room
price_per_room <- prices_df %>% 
  # group by room_type
  group_by(room_type) %>% 
  # return average price per room_type
  summarise(avg_price = mean(price)) %>% 
  # create bar chart to visualise result
  ggplot(aes(room_type, avg_price, fill = room_type)) +
  # specify bar chart
  geom_col(show.legend = FALSE) +
  # annotate each bar
  geom_label(mapping = aes(label = round(avg_price, 2)), size = 6, 
             fill = "#F5FFFA", fontface = "bold", 
             # position change to make sure label stays on page
             position = position_stack(vjust = 0.9)) +
  # add theme
  theme_classic() +
  # titles
  labs(x = "Room Type", y = "Average Price ($)", 
       title = "Average Price by Room Type")


# Average price per neighbourhood group
price_room_borough <- prices_df %>% 
  # group by both neighbourhood and room_type
  group_by(neighbourhood_group, room_type) %>% 
  # return average price for each room type in each neighbourhood
  summarise(avg_price = mean(price)) %>% 
  # sort from highest to lowest price
  arrange(desc(avg_price)) %>% 
  # create bar chart
  ggplot(aes(reorder(neighbourhood_group, avg_price), avg_price, 
             fill = room_type)) +
  # specify bar chart
  geom_col(show.legend = FALSE) + 
  # annotate bars
  geom_label(mapping = aes(label = round(avg_price, 2)), size = 2.5, 
             fill = "#F5FFFA", fontface = "bold", hjust = 0.5, 
             # position change to make sure label stays on page
             position = position_stack(vjust = 0.9)) +
  # theme
  theme_classic() +
  # titles
  labs(x = "New York Boroughs", y = "Average Price ($)", 
       title = "Average Price per Borough by Room Type") + 
  # split into room_types
  facet_wrap(~room_type) +
  # flip x and y axis
  coord_flip()
`summarise()` has grouped output by 'neighbourhood_group'. You can override using the `.groups` argument.
# plot both visualisations together
cowplot::plot_grid(price_per_room, price_room_borough, nrow = 2)

The 10 most expensive and cheapest New York Districts

# Top 10 most expensive districts on average
a <- prices_df %>% 
  # group by individual districts within neighbourhoods
  group_by(neighbourhood) %>% 
  # return average price
  summarise(avg_price = mean(price)) %>% 
  # sort from highest to lowest
  arrange(desc(avg_price)) %>% 
  # return the top 10 (out of 218)
  slice(1:10) %>% 
  # create bar plot
  ggplot(aes(reorder(neighbourhood, avg_price), avg_price, 
             fill = neighbourhood)) + 
  # specify bar chart
  geom_col(show.legend = FALSE) +
  # annotate bars 
  geom_label(mapping = aes(label = round(avg_price, 2)), size = 3, 
             fill = "#F5FFFA", fontface = "bold") +
  # theme
  theme_classic() +
  # change x-axis label position
  theme(axis.text.x = element_text(angle = 30, vjust = 0.95, hjust = 1)) +
  # titles
  labs(x = "Neighbourhood", y = "Average Price ($)", 
       title = "The 10 Most Expensive Districts on Average")

# top 10 least expensive districts on average
b <- prices_df %>% 
  # group by individual districts within neighbourhoods
  group_by(neighbourhood) %>% 
  # return average prices
  summarise(avg_price = mean(price)) %>% 
  # arrange from lowest to highest
  arrange(avg_price) %>% 
  # return bottom 10 prices
  slice(1:10) %>% 
  # create bar chart
  ggplot(aes(reorder(neighbourhood, avg_price), avg_price, 
             fill = neighbourhood)) + 
  # specify bar chart
  geom_col(show.legend = FALSE) +
  # annotate bars
  geom_label(mapping = aes(label = round(avg_price, 2)), size = 3, 
             fill = "#F5FFFA", fontface = "bold") +
  # theme
  theme_classic() +
  # change x-axis label position
  theme(axis.text.x = element_text(angle = 30, vjust = 0.95, hjust = 1)) +
  # titles
  labs(x = "Neighbourhood", y = "Average Price ($)", 
       title = "The 10 Least Expensive Districts on Average")

# add plots into the same output
cowplot::plot_grid(a, b, nrow=2)

The 10 most expensive districts are all located in Manhattan apart from, Neponsit (Queens) and WillowBrook (Staten Island)

The majority of the 10 least expensive districts reside in the Bronx, State Island and Queens

Average Reviews by Last Month Review Submitted

Suggests that most people are leaving reviews in the summer, indicating some seasonality to Airbnb booking in New York.

Price Density by Area

# price density by New York Borough
ggplot(
  # create plot for price less than $500
  subset(prices_df, price < 500),aes(x = price)) +
  # specify density plot
  geom_density(
    mapping = aes(fill = neighbourhood_group), 
    bandwidth = 100, alpha = 1, size = 0.5, show.legend = FALSE) +
  # theme
  theme_bw() +
  # show individual density plots for boroughs
  facet_wrap(~neighbourhood_group) +
  # titles
  labs(x = "Price", y = "Density", title = "Price Density by Borough")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
Please use `linewidth` instead.Warning: Ignoring unknown parameters: `bandwidth`

Pricing density plot reveals that boroughs with fewer amount of bookings (Queens, Staten Island and Bronx) have a higher density of lower prices

Most common areas (Mahattan and Brooklyn) have a wider density plot indicating that prices vary more.

Text Mining

I want to find the words that are associated with different price ranges.

So I need to create new variables which classify the price range of each airbnb

we will define price ranges based around the average price for total bookings

mean price is $142 therefore, low will be less than $100, medium will be between $100 - $200, high will be $200 - $300 and very high will be greater than $300

High/Very High Price Range

# words associated with high prices
high_price_words <- prices_new %>% 
  # filter for High and Very High price classes 
  filter(price_class %in% c("High", "Very High")) %>% 
  # take individual words from the name column
  unnest_tokens(word, name) %>% 
  # take out stop_words 
  anti_join(stop_words) %>% 
  # select word column
  dplyr::select(word)
Joining, by = "word"
# most common words
sorted_hp_words <- high_price_words %>%
  # sort most common words
  count(word, sort = TRUE)

# create word cloud for high/very high price Airbnbs
wordcloud(
  # words to use for word cloud 
  words = sorted_hp_words$word,
  # number of times these words appear
  freq = sorted_hp_words$n, 
  # maximum number of words used
  max.words=60, 
  # no random order
  random.order=FALSE, 
  # proportion of words with a 90 degree angle
  rot.per=0.35, 
  # add colour palette
  colors=brewer.pal(8, "Spectral"))

Words that stand out from wordcloud include: bedroom, apartment, village, luxury, location, Manhattan, spacious, park - start to understand what kind of Airbnbs are being advertised for high prices

Low Price Range

# words associated with low prices
low_price_words <- prices_new %>% 
  # filter for rows associated with low prices
  filter(price_class %in% "Low") %>% 
  # take words from name column
  unnest_tokens(word, name) %>% 
  # remove stop words
  anti_join(stop_words) %>% 
  # select word column
  dplyr::select(word)
Joining, by = "word"
# most common words
sorted_lp_words <- low_price_words %>% 
  # sort most common words
  count(word, sort = TRUE)

wordcloud(
  # select words 
  words = sorted_lp_words$word, 
  # number of times these words appear
  freq = sorted_lp_words$n, 
  # maximum number of words in cloud
  max.words=60, 
  # no random order
  random.order=FALSE, 
  # proportion of words rotated
  rot.per=0.35, 
  # colour palette
  colors=brewer.pal(8, "Spectral"))

when we look at the word cloud of the low price range, we see some similarities with the high price range indicating owners are trying to sell the property as up market.

Big emphasis on property being “private” which is likely to be a big concern for people when not paying that much. In contrast, privacy is a given when paying for high end accommodation.

Medium Price Range

# words associated with medium price range (around the average)
medium_price_words <- prices_new %>% 
  # filter rows associated with medium price range
  filter(price_class %in% "Medium") %>% 
  # select words from name column
  unnest_tokens(word, name) %>% 
  # remove stop words
  anti_join(stop_words) %>% 
  # select words
  dplyr::select(word)
Joining, by = "word"
# most common words
sorted_mp_words <- medium_price_words %>% 
  # sort words
  count(word, sort = TRUE)


# word cloud for medium price range
wordcloud(
  # words to use
  words = sorted_mp_words$word, 
  # frequency words appear
  freq = sorted_mp_words$n, 
  # maximum number of words
  max.words=60, 
  # random order = FALSE --> makes it neat
  random.order=FALSE, 
  # proportion of words rotated
  rot.per=0.35, 
  # colour palette
  colors=brewer.pal(8, "Spectral"))

Not a tremendous amount of difference here, probably as expected it takes a balance between low and high price ranges highlighting privacy as important but also more emphasis on location.

Geo-spatial Analysis

Can we see if the geo-spatial backs up the word cloud and bigram analysis of emphasis on location for medium and low price ranges

Density of New York Boroughs

Density of Price Class

can see the spread of prices based on the area, can see that lower prices tend to located to the perimeter of New York, indicating that location towards centre of Manhattan is a large determinant of price.

Leaflet Map

Leaflet map shows the contrast between extremities and city centre, evidently much higher prices in the city.

Summary

Prices are being impacted by: - location - type of accommodation

Model Building - Linear Regression

Approach: - manual - good way to become familiar with data - avoid over or under fitting the model

Goals of the model:

Distribution of Price

Dependent variable: Price

Important to investigate its distribution as it may require a transformation to create a better fit for the model

Price is heavily skewed to the right indicates a log transformation is needed to get a normal bell shaped distribution

# mean price 
mean_price <- prices_new %>% 
  summarise(m_price = mean(price))


# log bell shaped distribution 
prices_new %>% 
  # set price on x-axis as a natural logarithm
  ggplot(aes(log(price))) +
  # specify histogram
  geom_histogram(bins = 40, aes(y = ..density..), fill = "red") + 
  # insert density area
  geom_density(alpha = 0.2, fill = "red") +
  # insert line to indicate average
  geom_vline(data = mean_price, aes(xintercept =  log(m_price)), size = 1, 
             linetype = 2) +
  # theme
  theme_bw() +
  # title
  labs(title = "Distribution of log(price)")

Finalising Data for Model

Dropped variables:

  • ‘name’ - use dummy variables for important words and bigrams instead
  • ‘neighbourhood’ - use the categorical variable for five New York Boroughs
  • ‘price_class’ - high correlated with dependent variable price

Dummy variables created from text analysis:

  • ‘apartment’
  • ‘private’
  • ‘central park’
prices_reg_df <- prices_new %>% 
  # create dummy variables for chosen words that may impact price
  mutate(apartment_ad = if_else(str_detect(name, "[Aa]partment"), "YES", "NO"),
         private_ad = if_else(str_detect(name, "[Pp]rivate"), "YES", "NO"),
         central_park_ad = if_else(str_detect(name, "[Cc]entral [Pp]ark"), 
                                   "YES", "NO"),
         ) %>% 
  # remove variables not considered for model
  dplyr::select(-c(name, neighbourhood, price_class)) %>% 
  # log transform price
  mutate(log_price = log(price + 1)) %>% 
  # bring price to the front
  dplyr::select(log_price,price, everything()) %>% 
  # change all character variables to factor
  mutate(across(.cols = is.character, 
                .fns = as_factor)) %>% 
  # remove original price variable
  dplyr::select(-price)
  

Check alias() function to check for multicollinearity

# check for multicollinearity
alias(lm(log_price ~ ., data = prices_reg_df))
Model :
log_price ~ neighbourhood_group + latitude + longitude + room_type + 
    minimum_nights + number_of_reviews + reviews_per_month + 
    calculated_host_listings_count + availability_365 + last_review_month + 
    name_length + apartment_ad + private_ad + central_park_ad

Data is now ready to be used in regression analysis

Use ggpairs() to investigate which variables are highly correlated with price.

Data set is large so relationship analysis will be divided into numeric and non-numeric datatypes

# select variables that are numeric
variable_numeric <- prices_reg_df %>%
  select_if(is.numeric)

# ggpairs with only numeric variables
ggpairs(variable_numeric)

# select variables that are categorical 
variable_nonnumeric <- prices_reg_df %>%
  # use function to return variables that are categorical
  select_if(function(x) !is.numeric(x))

# need to add log_price to non-numeric data 
variable_nonnumeric$log_price <- prices_reg_df$log_price

# non-numeric ggpairs
ggpairs(variable_nonnumeric)

Univariate Regression

Largest correlation with price: longitude - negatively correlated by 0.155 - statistically significant at the 0.001 level of significance.

The non-numeric ggpairs suggests that several variables will be able to explain price variance.

relationship between log(price) and longitude

create model

# linear regression model
model_1 <- lm(log_price ~ longitude, data = prices_reg_df)

Regression Diagnostics

graph 1 (population) - tell us most observations are independence, blue line declines at the end indicating other chunk of observations are not independently distributed. Need another variable.

graph 2 (distribution) - shows a fairly normal distribution, again, a bend at the end indicates model needs more to give a better distribution

graph 3 (homoskedasticity) - There is heteroskedasticity in the model indicated my curve on blue line

graph 4 (outliers) - There is a small number of outliers, and points are not highly leveraged.

model interpretation

# obtain results of regression model
summary(model_1)

Call:
lm(formula = log_price ~ longitude, data = prices_reg_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7215 -0.4287 -0.0309  0.3813  4.6629 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -335.83166    5.02723  -66.80   <2e-16 ***
longitude     -4.60491    0.06798  -67.74   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6256 on 38835 degrees of freedom
Multiple R-squared:  0.1057,    Adjusted R-squared:  0.1056 
F-statistic:  4589 on 1 and 38835 DF,  p-value: < 2.2e-16

coefficient:

  • An increase in longitude by one unit is associated with a change in price by 99.9% on average.

R-squared - longitude explains 10.6% of the variance of airbnb price.

Multivariate Regression

From the ggpairs plot and previous analysis, neighbourhood_group is suggests having an impact on airbnb prices, I will add this next.

Relationship between Price and Borough

regression diagnostics

graph 1 - independent population (achieved) graph 2 - still a skew in distribution graph 3 - Homoskdasticity (achieved) graph 4 - No highly leveraged points, but there are still (potentially) a few outliers

model interpretation

# regression results
summary(model_2)

Call:
lm(formula = log_price ~ longitude + neighbourhood_group, data = prices_reg_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5931 -0.4275 -0.0334  0.3609  4.6367 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      -3.283e+02  7.499e+00 -43.780  < 2e-16 ***
longitude                        -4.501e+00  1.014e-01 -44.391  < 2e-16 ***
neighbourhood_groupManhattan      2.790e-01  7.033e-03  39.667  < 2e-16 ***
neighbourhood_groupQueens         1.506e-01  1.294e-02  11.643  < 2e-16 ***
neighbourhood_groupStaten Island -9.425e-01  3.777e-02 -24.956  < 2e-16 ***
neighbourhood_groupBronx         -6.477e-02  2.201e-02  -2.942  0.00326 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6038 on 38831 degrees of freedom
Multiple R-squared:  0.167, Adjusted R-squared:  0.1669 
F-statistic:  1557 on 5 and 38831 DF,  p-value: < 2.2e-16

coefficients:

  • all statistically significant at levels of significance - can reject the null hypothesis and say that coefficients are statistically different from zero.

  • An increase in Manhattan by one unit (zero to one) is associated with a change in price by (e^0.27-1) * 100 = 31%, holding all other factors constant

  • An increase in Staten Island by one unit (zero to one) is associated with a change in price by (e^-0.94 - 1) * 100 = -60.9%, holding all other factors constant

Anova function

# function to check if dummy variables are useful
anova(model_1, model_2)
Analysis of Variance Table

Model 1: log_price ~ longitude
Model 2: log_price ~ longitude + neighbourhood_group
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1  38835 15198                                  
2  38831 14156  4    1042.4 714.85 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

anova function confirms the neighbourhood_group dummy variable was good to include in the model.

Model 3

before adding a new variable, must check the residuals of this model against the remaining variables in the dataset

check residuals

# non-numeric ggpairs
ggpairs(price_resid_nonnumeric)

From numeric variables, availability_365 has the highest correlation with price a positive correlation of 0.131 (statistically significant).

However, the non-numeric variables indicate that room type may present a better explanation of price variance.

Comparison between impact of Room Type and Availability on Price


avail_plot <- price_resid_numeric %>% 
  ggplot(aes(availability_365, resid)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, colour = "red") + 
  theme_bw() +
  labs(x = "Yearly Room Availability", y = "Residuals", 
       title = "Relationship Residuals and Availability") +
  theme(plot.title = element_text(size=10))

room_type_plot <- price_resid_nonnumeric %>% 
  ggplot(aes(room_type, resid, fill = room_type)) + 
  geom_boxplot(show.legend = FALSE) +
  theme_bw() + 
  labs(title = "Relationship Residuals and Room Type",
       x = "Room Type", y = "Residuals") +
  theme(plot.title = element_text(size=10))

cowplot::plot_grid(avail_plot, room_type_plot, nrow = 1) 
`geom_smooth()` using formula = 'y ~ x'

The comparison suggests room type should offer more explanation.

graph 1 - residuals are independent graph 2 - the residuals seem to be increasingly not distributed around zero with more skew at the beginning and end graph 3 - conditional variance of residuals is constant (homoskedasticity)

# model results
summary(model_3)

Call:
lm(formula = log_price ~ longitude + neighbourhood_group + room_type, 
    data = prices_reg_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9468 -0.2982 -0.0401  0.2313  4.9777 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      -2.133e+02  5.899e+00 -36.160  < 2e-16 ***
longitude                        -2.942e+00  7.978e-02 -36.873  < 2e-16 ***
neighbourhood_groupManhattan      2.397e-01  5.495e-03  43.611  < 2e-16 ***
neighbourhood_groupQueens         1.183e-01  1.010e-02  11.709  < 2e-16 ***
neighbourhood_groupStaten Island -6.894e-01  2.951e-02 -23.358  < 2e-16 ***
neighbourhood_groupBronx         -5.006e-02  1.718e-02  -2.914  0.00357 ** 
room_typeEntire home/apt          7.443e-01  4.943e-03 150.572  < 2e-16 ***
room_typeShared room             -3.956e-01  1.659e-02 -23.841  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4711 on 38829 degrees of freedom
Multiple R-squared:  0.493, Adjusted R-squared:  0.4929 
F-statistic:  5393 on 7 and 38829 DF,  p-value: < 2.2e-16

As anticipated, room_type greatly enhanced the explanation of the model. The R-squared suggests the model explains 49.3% of the variance in price.

The variable coefficients are all statistically significant therefore can be interpreted.

Model 4

Check residuals

As before, availability is displaying by far the strongest correlation, and there does not seem to be any stand out non-numeric variables. Therefore, room availability will be used in the next model.

graph 1 - model population is independently distributed graph 2 - still skew at both ends of graph, indicating graph is only somewhat normally distributed graph 3 - conditional variance of residuals is constant (homoskedastic)

# model results
summary(model_4)

Call:
lm(formula = log_price ~ longitude + neighbourhood_group + room_type + 
    availability_365, data = prices_reg_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9243 -0.2919 -0.0406  0.2328  5.0684 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      -2.274e+02  5.815e+00 -39.107  < 2e-16 ***
longitude                        -3.131e+00  7.863e-02 -39.819  < 2e-16 ***
neighbourhood_groupManhattan      2.332e-01  5.408e-03  43.122  < 2e-16 ***
neighbourhood_groupQueens         1.041e-01  9.943e-03  10.466  < 2e-16 ***
neighbourhood_groupStaten Island -7.852e-01  2.915e-02 -26.939  < 2e-16 ***
neighbourhood_groupBronx         -7.984e-02  1.692e-02  -4.719 2.38e-06 ***
room_typeEntire home/apt          7.438e-01  4.861e-03 153.004  < 2e-16 ***
room_typeShared room             -4.276e-01  1.634e-02 -26.166  < 2e-16 ***
availability_365                  6.674e-04  1.840e-05  36.271  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4633 on 38828 degrees of freedom
Multiple R-squared:  0.5096,    Adjusted R-squared:  0.5095 
F-statistic:  5043 on 8 and 38828 DF,  p-value: < 2.2e-16

R-squared has only increased marginally to 0.51 - model explains 51% of the variance in price

All variables in the model are statistically significant

Adding an interaction term

potential terms:

  • longitude:neighbourhood_group
  • longitude:room_type
  • longitude:availability_365
  • neighbourhood_group:room_type
  • neighbourhood_group:availability_365
  • room_type:availability_365

Through process of elimination, longitude:neighbourhood_group

Plotting the interaction

# add model residuals to data
price_resid <- prices_reg_df %>% 
  # add model residuals
  add_residuals(model_5) %>% 
  # remove log_price
  dplyr::select(-log_price)


# check the interaction between longitude and neighbourhood_group
coplot(resid ~ longitude | neighbourhood_group,
       # give an action to be carried out in each panel
       panel = function(x, y, ...){
         # plot coordinates of x and y
         points(x, y)
         # insert linear model line
         abline(lm(y ~ x), col = "blue")
       },
       data = price_resid, rows = 1)

Model Diagnostics

graph 1 - population is independent graph 2 - model still not completely evenly distributed around 0 graph 3 - conditional variance of residuals is constant (homoskedasticity) graph 4 - there are still some outliers, but no highly leveraged points

Model Summary

# model results
summary(model_5)

Call:
lm(formula = log_price ~ longitude + neighbourhood_group + room_type + 
    availability_365 + longitude:neighbourhood_group, data = prices_reg_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9155 -0.2856 -0.0436  0.2300  5.1577 

Coefficients:
                                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                                -2.378e+02  1.031e+01 -23.065  < 2e-16 ***
longitude                                  -3.272e+00  1.394e-01 -23.468  < 2e-16 ***
neighbourhood_groupManhattan               -3.219e+02  1.557e+01 -20.677  < 2e-16 ***
neighbourhood_groupQueens                   1.761e+02  1.349e+01  13.053  < 2e-16 ***
neighbourhood_groupStaten Island            3.119e+02  5.582e+01   5.587 2.33e-08 ***
neighbourhood_groupBronx                    2.388e+02  3.601e+01   6.631 3.38e-11 ***
room_typeEntire home/apt                    7.255e-01  4.820e-03 150.506  < 2e-16 ***
room_typeShared room                       -4.263e-01  1.609e-02 -26.491  < 2e-16 ***
availability_365                            6.323e-04  1.815e-05  34.842  < 2e-16 ***
longitude:neighbourhood_groupManhattan     -4.354e+00  2.105e-01 -20.689  < 2e-16 ***
longitude:neighbourhood_groupQueens         2.382e+00  1.825e-01  13.053  < 2e-16 ***
longitude:neighbourhood_groupStaten Island  4.219e+00  7.533e-01   5.601 2.15e-08 ***
longitude:neighbourhood_groupBronx          3.233e+00  4.874e-01   6.633 3.32e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4562 on 38824 degrees of freedom
Multiple R-squared:  0.5246,    Adjusted R-squared:  0.5244 
F-statistic:  3570 on 12 and 38824 DF,  p-value: < 2.2e-16

Model Summary:

  • All explanatory variables are statistically significant

  • R-squared: model explains 52.5% of variance in price

    • model suffices as an ok explanation
  • Room_type had the greatest influence on the model

    • interpret the coefficient for room_type- entire/apt:
      • An increase in entire/apt by one unit (zero to one) is associated with a change in price by (e^0.73-1) * 100 = 107.5%, holding all other factors constant

Variable relative importance

# function to calculate variable importance
calc.relimp(model_5, type = "lmg", rela = TRUE)
Response variable: log_price 
Total response variance: 0.4375758 
Analysis based on 38837 observations 

12 Regressors: 
Some regressors combined in groups: 
        Group  neighbourhood_group : neighbourhood_groupManhattan neighbourhood_groupQueens neighbourhood_groupStaten Island neighbourhood_groupBronx 
        Group  room_type : room_typeEntire home/apt room_typeShared room 
        Group  longitude:neighbourhood_group : longitude:neighbourhood_groupManhattan longitude:neighbourhood_groupQueens longitude:neighbourhood_groupStaten Island longitude:neighbourhood_groupBronx 

 Relative importance of 5 (groups of) regressors assessed: 
 neighbourhood_group room_type longitude:neighbourhood_group longitude availability_365 
 
Proportion of variance explained by model: 52.46%
Metrics are normalized to sum to 100% (rela=TRUE). 

Relative importance metrics: 

                                     lmg
neighbourhood_group           0.15650767
room_type                     0.66297003
longitude:neighbourhood_group 0.03891208
longitude                     0.11782702
availability_365              0.02378320

Average coefficients for different model sizes: 

                                                  1group       2groups       3groups       4groups       5groups
longitude                                  -4.6049149739 -4.1229356833 -4.051358e+00 -3.897941e+00 -3.271661e+00
neighbourhood_groupManhattan                0.3819453426  0.3211757943 -1.070997e+02 -2.511174e+02 -3.218665e+02
neighbourhood_groupQueens                  -0.2087349654 -0.0655638368  8.477795e+01  1.718999e+02  1.760652e+02
neighbourhood_groupStaten Island           -0.2501789720 -0.4947127336  9.315515e+01  2.287431e+02  3.118768e+02
neighbourhood_groupBronx                   -0.3667919672 -0.2372578195  1.284316e+02  2.514425e+02  2.387853e+02
room_typeEntire home/apt                    0.8196414599  0.7858184226  7.602355e-01  7.344142e-01  7.254613e-01
room_typeShared room                       -0.3741098014 -0.3871140671 -4.059261e-01 -4.118401e-01 -4.263369e-01
availability_365                            0.0003947914  0.0005442186  6.232164e-04  6.210777e-04  6.323326e-04
longitude:neighbourhood_groupManhattan               NaN           NaN -5.805751e+00 -5.096923e+00 -4.354219e+00
longitude:neighbourhood_groupQueens                  NaN           NaN  4.586125e+00  3.487592e+00  2.381855e+00
longitude:neighbourhood_groupStaten Island           NaN           NaN  5.069942e+00  4.647986e+00  4.219434e+00
longitude:neighbourhood_groupBronx                   NaN           NaN  6.959079e+00  5.105896e+00  3.232856e+00

Variables of relative importance:

Room type, perhaps unsurprisingly, is the most relevant variable for explaining variations in price. The least explanatory is availability_365

Conclusion and Recommendations

For Consumers:

For hosts: - How you describe your property doesn’t translate to being able to charge higher prices - will be able to charge a much higher price for property listed as whole apartment - property availability does not have much impact on price

Future analysis

More data on:

-yearly data - time-series analysis - identify cyclical trends

---
title: "R Notebook"
output: html_notebook
---

```{r}
library(tidyverse)
library(janitor)
library(lubridate)
library(GGally)
library(ggfortify)
library(modelr)
library(tidytext)
library(wordcloud)
library(leaflet)
library(ggridges)
library(ggthemes)
library(relaimpo)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
```


# Introduction

## About the Data

The data is about New York City Airbnb listings in 2019. 

The data includes information on, prices, New York neighbourhoods and room types,
to name a few. Geospatial coordinates are also present, offering valuable insight
into Airbnb locations. 

## Project Objectives

Primary aim:
- Analyse the determinants of airbnb prices
  - Offer recommendations for customers booking airbnbs
  - Provide insights for hosts - adopt an appropriate pricing strategy
  
## Overview

- Ethical Considerations
- Data Cleaning and Wrangling
- Exploratory Analysis
- Text mining
- Geo-spatial Analysis
- Model Building
  - Univariate Regression
  - Multivariate Regression
- Analysis Conclusions


```{r}
prices <- read_csv("raw_data/AB_NYC_2019.csv") %>% clean_names()
```


# Ethical Considerations

The data contains information on host names and unique IDs. To avoid any ethical
issues I chose to remove these variables. 

- Reduce bias
- Follow law 
- ensure consumer trust

# Clean Data

## Check Missing Values

```{r}
# check for missing values
prices %>% 
  # return total missing values in each column
  summarise(
    across(
      .cols = everything(),
      .fns = ~sum(is.na(.x))
    )
  ) %>% 
  # select columns that have missing values
  dplyr::select(c(name, host_name, last_review, reviews_per_month))
```
Many missing values in last_review and reviews_per_month (10052 rows)

- last_review will be dropped as inconsequential variable
- values dropped from reviews_per_month
  - coalescing with mean may warp data too much 


## Data Cleaning

```{r}
prices_df <- prices %>% 
  # drop host names and host_id (ethical)
  dplyr::select(-c(host_name, host_id, id)) %>% 
  # take out month from last_review 
  mutate(last_review_month = month(last_review, label = TRUE),
         # take name length
         name_length = str_length(name),
         # impute missing values with average
         reviews_per_month = coalesce(reviews_per_month, 
                                      mean(reviews_per_month, na.rm = TRUE))
         ) %>% 
  # remove last_review column
  dplyr::select(-last_review)

# check for missing values
prices_df %>% 
  # return total missing values in each column
  summarise(
    across(
      .cols = everything(),
      .fns = ~sum(is.na(.x))
    )
  ) %>% 
  # columns that returned missing values
  dplyr::select(c(name, last_review_month, name_length))


# drop missing values
prices_df <- prices_df %>% 
  drop_na()



```


# Exploratory Analysis

## Total Number of Bookings per Borough

```{r}
# number of bookings per borough
prices_df %>% 
  # group by neighbourhood
  group_by(neighbourhood_group) %>% 
  # return total bookings for each neighbourhood
  summarise(num_bookings = n()) %>% 
  # arrange from highest to lowest
  arrange(desc(num_bookings)) %>% 
  # create a bar chart to visualise result 
  ggplot(aes(reorder(neighbourhood_group, num_bookings), num_bookings, 
             fill = neighbourhood_group)) +
  # specify bar chart
  geom_col(show.legend = FALSE) +
  # annotate each bar
  geom_label(mapping = aes(label = num_bookings), size = 3, 
             fill = "#F5FFFA", fontface = "bold", hjust = 0.5) +
  # add theme
  theme_classic() +
  # titles
  labs(x = "Borough", y = "Number of Bookings", 
       title = "Total Number of Bookings per Borough")
```
Manhattan and Brooklyn both by far the most popular areas for Airbnb listings.

Two central Boroughs which may indicate the main reason people book is for 
holidays / Tourism. 


## Average Price per Neighbourhood Group

```{r}

# Average Price per Room
price_per_room <- prices_df %>% 
  # group by room_type
  group_by(room_type) %>% 
  # return average price per room_type
  summarise(avg_price = mean(price)) %>% 
  # create bar chart to visualise result
  ggplot(aes(room_type, avg_price, fill = room_type)) +
  # specify bar chart
  geom_col(show.legend = FALSE) +
  # annotate each bar
  geom_label(mapping = aes(label = round(avg_price, 2)), size = 6, 
             fill = "#F5FFFA", fontface = "bold", 
             # position change to make sure label stays on page
             position = position_stack(vjust = 0.9)) +
  # add theme
  theme_classic() +
  # titles
  labs(x = "Room Type", y = "Average Price ($)", 
       title = "Average Price by Room Type")


# Average price per neighbourhood group
price_room_borough <- prices_df %>% 
  # group by both neighbourhood and room_type
  group_by(neighbourhood_group, room_type) %>% 
  # return average price for each room type in each neighbourhood
  summarise(avg_price = mean(price)) %>% 
  # sort from highest to lowest price
  arrange(desc(avg_price)) %>% 
  # create bar chart
  ggplot(aes(reorder(neighbourhood_group, avg_price), avg_price, 
             fill = room_type)) +
  # specify bar chart
  geom_col(show.legend = FALSE) + 
  # annotate bars
  geom_label(mapping = aes(label = round(avg_price, 2)), size = 2.5, 
             fill = "#F5FFFA", fontface = "bold", hjust = 0.5, 
             # position change to make sure label stays on page
             position = position_stack(vjust = 0.9)) +
  # theme
  theme_classic() +
  # titles
  labs(x = "New York Boroughs", y = "Average Price ($)", 
       title = "Average Price per Borough by Room Type") + 
  # split into room_types
  facet_wrap(~room_type) +
  # flip x and y axis
  coord_flip()

# plot both visualisations together
cowplot::plot_grid(price_per_room, price_room_borough, nrow = 2)
```


## The 10 most expensive and cheapest New York Districts

```{r}
# Top 10 most expensive districts on average
a <- prices_df %>% 
  # group by individual districts within neighbourhoods
  group_by(neighbourhood) %>% 
  # return average price
  summarise(avg_price = mean(price)) %>% 
  # sort from highest to lowest
  arrange(desc(avg_price)) %>% 
  # return the top 10 (out of 218)
  slice(1:10) %>% 
  # create bar plot
  ggplot(aes(reorder(neighbourhood, avg_price), avg_price, 
             fill = neighbourhood)) + 
  # specify bar chart
  geom_col(show.legend = FALSE) +
  # annotate bars 
  geom_label(mapping = aes(label = round(avg_price, 2)), size = 3, 
             fill = "#F5FFFA", fontface = "bold") +
  # theme
  theme_classic() +
  # change x-axis label position
  theme(axis.text.x = element_text(angle = 30, vjust = 0.95, hjust = 1)) +
  # titles
  labs(x = "Neighbourhood", y = "Average Price ($)", 
       title = "The 10 Most Expensive Districts on Average")

# top 10 least expensive districts on average
b <- prices_df %>% 
  # group by individual districts within neighbourhoods
  group_by(neighbourhood) %>% 
  # return average prices
  summarise(avg_price = mean(price)) %>% 
  # arrange from lowest to highest
  arrange(avg_price) %>% 
  # return bottom 10 prices
  slice(1:10) %>% 
  # create bar chart
  ggplot(aes(reorder(neighbourhood, avg_price), avg_price, 
             fill = neighbourhood)) + 
  # specify bar chart
  geom_col(show.legend = FALSE) +
  # annotate bars
  geom_label(mapping = aes(label = round(avg_price, 2)), size = 3, 
             fill = "#F5FFFA", fontface = "bold") +
  # theme
  theme_classic() +
  # change x-axis label position
  theme(axis.text.x = element_text(angle = 30, vjust = 0.95, hjust = 1)) +
  # titles
  labs(x = "Neighbourhood", y = "Average Price ($)", 
       title = "The 10 Least Expensive Districts on Average")

# add plots into the same output
cowplot::plot_grid(a, b, nrow=2)
```

The 10 most expensive districts are all located in Manhattan apart from, 
Neponsit (Queens) and WillowBrook (Staten Island)

The majority of the 10 least expensive districts reside in the Bronx, State 
Island and Queens

## Average Reviews by Last Month Review Submitted

```{r}
# Average Reviews per Month by Last Month review was left
prices_df %>% 
  # change last_review_month to a factor 
  mutate(last_review_month = as_factor(last_review_month)) %>% 
  # group by last_review_month
  group_by(last_review_month) %>% 
  # return the average_reviews_per_month 
  summarise(n = mean(reviews_per_month)) %>% 
  # create a bar graph
  ggplot(aes(last_review_month, n, fill = last_review_month)) + 
  # specify bar graph
  geom_col(show.legend = FALSE) +
  # theme
  theme_bw() +
  # titles
  labs(x = "Last Review Month", y = "Average Reviews per Month", 
       title = "Average Reviews by Last Month Review Submitted")
```

Suggests that most people are leaving reviews in the summer, indicating some
seasonality to Airbnb booking in New York. 


# Price Density by Area

```{r}
# price density by New York Borough
ggplot(
  # create plot for price less than $500
  subset(prices_df, price < 500),aes(x = price)) +
  # specify density plot
  geom_density(
    mapping = aes(fill = neighbourhood_group), 
    bandwidth = 100, alpha = 1, size = 0.5, show.legend = FALSE) +
  # theme
  theme_bw() +
  # show individual density plots for boroughs
  facet_wrap(~neighbourhood_group) +
  # titles
  labs(x = "Price", y = "Density", title = "Price Density by Borough")
```

Pricing density plot reveals that boroughs with fewer amount of bookings (Queens,
Staten Island and Bronx) have a higher density of lower prices 

Most common areas (Mahattan and Brooklyn) have a wider density plot indicating 
that prices vary more. 


# Text Mining

I want to find the words that are associated with different price ranges. 

So I need to create new variables which classify the price range of each
airbnb

we will define price ranges based around the average price for total bookings

```{r}
# return mean_price of Airbnbs
prices_df %>% 
  summarise(mean_price = mean(price))
```
mean price is $142 therefore, low will be less than $100, medium will be 
between $100 - $200, high will be $200 - $300 and very high will be greater than
$300

```{r}
# create new categorical variable for price
prices_new <- prices_df %>% 
  mutate(price_class = case_when(
    price < 100 ~ "Low", 
    price < 200 ~ "Medium", 
    price < 300 ~ "High",
    TRUE ~ "Very High"
  ))
  
```



## High/Very High Price Range


```{r}
# words associated with high prices
high_price_words <- prices_new %>% 
  # filter for High and Very High price classes 
  filter(price_class %in% c("High", "Very High")) %>% 
  # take individual words from the name column
  unnest_tokens(word, name) %>% 
  # take out stop_words 
  anti_join(stop_words) %>% 
  # select word column
  dplyr::select(word)

# most common words
sorted_hp_words <- high_price_words %>%
  # sort most common words
  count(word, sort = TRUE)

# create word cloud for high/very high price Airbnbs
wordcloud(
  # words to use for word cloud 
  words = sorted_hp_words$word,
  # number of times these words appear
  freq = sorted_hp_words$n, 
  # maximum number of words used
  max.words=60, 
  # no random order
  random.order=FALSE, 
  # proportion of words with a 90 degree angle
  rot.per=0.35, 
  # add colour palette
  colors=brewer.pal(8, "Spectral"))

```


Words that stand out from wordcloud include: bedroom, apartment, village, luxury, 
location, Manhattan, spacious, park
- start to understand what kind of Airbnbs are being advertised for high prices


## Low Price Range


```{r}
# words associated with low prices
low_price_words <- prices_new %>% 
  # filter for rows associated with low prices
  filter(price_class %in% "Low") %>% 
  # take words from name column
  unnest_tokens(word, name) %>% 
  # remove stop words
  anti_join(stop_words) %>% 
  # select word column
  dplyr::select(word)

# most common words
sorted_lp_words <- low_price_words %>% 
  # sort most common words
  count(word, sort = TRUE)

wordcloud(
  # select words 
  words = sorted_lp_words$word, 
  # number of times these words appear
  freq = sorted_lp_words$n, 
  # maximum number of words in cloud
  max.words=60, 
  # no random order
  random.order=FALSE, 
  # proportion of words rotated
  rot.per=0.35, 
  # colour palette
  colors=brewer.pal(8, "Spectral"))
```

when we look at the word cloud of the low price range, we see some similarities
with the high price range indicating owners are trying to sell the property as
up market. 

Big emphasis on property being "private" which is likely to be a big concern
for people when not paying that much. In contrast, privacy is a given when paying
for high end accommodation. 



## Medium Price Range

```{r}
# words associated with medium price range (around the average)
medium_price_words <- prices_new %>% 
  # filter rows associated with medium price range
  filter(price_class %in% "Medium") %>% 
  # select words from name column
  unnest_tokens(word, name) %>% 
  # remove stop words
  anti_join(stop_words) %>% 
  # select words
  dplyr::select(word)


# most common words
sorted_mp_words <- medium_price_words %>% 
  # sort words
  count(word, sort = TRUE)


# word cloud for medium price range
wordcloud(
  # words to use
  words = sorted_mp_words$word, 
  # frequency words appear
  freq = sorted_mp_words$n, 
  # maximum number of words
  max.words=60, 
  # random order = FALSE --> makes it neat
  random.order=FALSE, 
  # proportion of words rotated
  rot.per=0.35, 
  # colour palette
  colors=brewer.pal(8, "Spectral"))
```

Not a tremendous amount of difference here, probably as expected it takes a 
balance between low and high price ranges highlighting privacy as important but 
also more emphasis on location. 


# Most Popular Bigrams used in Airbnb Name

Analysis of the most popular word combinations

```{r}
# Obtain the 2 word bigrams
bigram_names <- prices_new %>% 
  unnest_tokens(bigram, name, token = "ngrams", n = 2)

# Separate into two words
bigrams_separated <- bigram_names %>% 
  separate(bigram, c("word1", "word2"), sep = " ")

# filter the words for stop words
bigrams_filtered <- bigrams_separated %>% 
  # filter words NOT IN stop words
  filter(!word1 %in% stop_words$word) %>% 
  filter(!word2 %in% stop_words$word)

# count and sort the most popular words
bigrams_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

# most popular bigrams graph
bigrams_counts %>% 
  # filter out numbers
  filter(!str_detect(word1, "[0-9]")) %>% 
  # unite both word columns
  unite(col = "bigrams", c("word1", "word2"), sep = " ", remove = TRUE) %>% 
  # get top 10
  slice(1:10) %>% 
  # visualise
  ggplot(aes(reorder(bigrams, n), n, fill = bigrams)) + 
  # specify bar graph
  geom_col(show.legend = FALSE) +
  # flip x and y-axis
  coord_flip() +
  # theme
  theme_bw() + 
  # annotate each bar
  geom_label(mapping = aes(label = n), size = 3, 
             fill = "#F5FFFA", fontface = "bold") +
  # titles
  labs(y = "Number of Bigram Mentions", x = "Bigram", 
       title = "Most Popular Bigrams in Airbnb Name")

```

Hosts emphasis on location is a common strategy to entice consumers

consumers want to be closer to famous landmarks. 


# Geo-spatial Analysis


Can we see if the geo-spatial backs up the word cloud and bigram analysis of 
emphasis on location for medium and low price ranges

## Density of New York Boroughs


```{r}
# Density of New York Boroughs
prices_new %>% 
  # plot longitude and latitude
  ggplot(aes(longitude, latitude)) +
  # specify density plot - has to be geom_density2d
  # specify density by neighbourhoods
  geom_density2d(aes(colour = neighbourhood_group)) + 
  # theme
  theme_bw() +
  # title
  labs(title = "Density of Boroughs")
```

## Density of Price Class

```{r}
# Density of Price Class
prices_new %>% 
  # plot longitude and latitude
  ggplot(aes(longitude, latitude)) +
  # specify density by price_class
  geom_density2d(aes(colour = price_class)) + 
  # theme
  theme_bw() +
  # titles
  labs(title = "Density of Price Class")

```


can see the spread of prices based on the area, can see that lower prices 
tend to located to the perimeter of New York, indicating that location towards
centre of Manhattan is a large determinant of price.


## Leaflet Map

```{r}
# define colour palette for map
 pal <- colorFactor(
   # select own colour scheme
   palette = c("orange", "white", "yellow", "red"), 
   # select variable for colour palette
   domain = prices_df$price_class)

# leaflet map
leaflet(data = prices_new) %>% 
  # add dark background for map
  addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>% 
  # add marker for each Airbnb based on price_class
  addCircleMarkers(~longitude, ~latitude, color = ~pal(price_class), weight = 1, 
                   radius=1, fillOpacity = 0.1, opacity = 0.1) %>%
  # add legend for colour associated with price_class
  addLegend("bottomright", pal = pal, values = ~price_class,
            title = "Price Class",
            opacity = 1
  ) 
```


Leaflet map shows the contrast between extremities and city centre, evidently
much higher prices in the city. 

__Summary__

Prices are being impacted by:
- location
- type of accommodation 


# Model Building - Linear Regression

Approach: 
- manual
- good way to become familiar with data
- avoid over or under fitting the model

Goals of the model: 

- A well fit model that reasonably explains variance in price
- satisfies assumptions needed to validate model
- Be able to output statistically significant coefficients for consumer/host
recommendations


## Distribution of Price

Dependent variable: Price

Important to investigate its distribution as it may require a transformation to 
create a better fit for the model


```{r}
# distribution of price 
prices_new %>% 
  # select price for x-axis
  ggplot(aes(price)) +
  # specify histogram
  geom_histogram(bins = 100, aes(y = ..density..), fill = "red") + 
  # add in a density plot
  geom_density(alpha = 0.2, fill = "red") +
  # theme
  theme_bw() +
  # title
  labs(title = "Distribution of Price")
```

Price is heavily skewed to the right indicates a log transformation is needed
to get a normal bell shaped distribution

```{r}
# mean price 
mean_price <- prices_new %>% 
  summarise(m_price = mean(price))


# log bell shaped distribution 
prices_new %>% 
  # set price on x-axis as a natural logarithm
  ggplot(aes(log(price))) +
  # specify histogram
  geom_histogram(bins = 40, aes(y = ..density..), fill = "red") + 
  # insert density area
  geom_density(alpha = 0.2, fill = "red") +
  # insert line to indicate average
  geom_vline(data = mean_price, aes(xintercept =  log(m_price)), size = 1, 
             linetype = 2) +
  # theme
  theme_bw() +
  # title
  labs(title = "Distribution of log(price)")
```



## Finalising Data for Model

Dropped variables:

- 'name' - use dummy variables for important words and bigrams instead
- 'neighbourhood' - use the categorical variable for five New York Boroughs
- 'price_class' - high correlated with dependent variable price

Dummy variables created from text analysis: 

- 'apartment'
- 'private'
- 'central park'

```{r}
prices_reg_df <- prices_new %>% 
  # create dummy variables for chosen words that may impact price
  mutate(apartment_ad = if_else(str_detect(name, "[Aa]partment"), "YES", "NO"),
         private_ad = if_else(str_detect(name, "[Pp]rivate"), "YES", "NO"),
         central_park_ad = if_else(str_detect(name, "[Cc]entral [Pp]ark"), 
                                   "YES", "NO"),
         ) %>% 
  # remove variables not considered for model
  dplyr::select(-c(name, neighbourhood, price_class)) %>% 
  # log transform price
  mutate(log_price = log(price + 1)) %>% 
  # bring price to the front
  dplyr::select(log_price,price, everything()) %>% 
  # change all character variables to factor
  mutate(across(.cols = is.character, 
                .fns = as_factor)) %>% 
  # remove original price variable
  dplyr::select(-price)
  
```


Check alias() function to check for multicollinearity 

```{r}
# check for multicollinearity
alias(lm(log_price ~ ., data = prices_reg_df))
```

Data is now ready to be used in regression analysis



Use ggpairs() to investigate which variables are highly correlated with price.

Data set is large so relationship analysis will be divided into numeric and 
non-numeric datatypes

```{r message=FALSE}
# select variables that are numeric
variable_numeric <- prices_reg_df %>%
  select_if(is.numeric)

# ggpairs with only numeric variables
ggpairs(variable_numeric)
```


```{r message=FALSE}
# select variables that are categorical 
variable_nonnumeric <- prices_reg_df %>%
  # use function to return variables that are categorical
  select_if(function(x) !is.numeric(x))

# need to add log_price to non-numeric data 
variable_nonnumeric$log_price <- prices_reg_df$log_price

# non-numeric ggpairs
ggpairs(variable_nonnumeric)
```


## Univariate Regression

Largest correlation with price: longitude
- negatively correlated by 0.155
- statistically significant at the 0.001 level of significance. 

The non-numeric ggpairs suggests that several variables will be able to explain 
price variance.  

### relationship between log(price) and longitude

```{r}
# scatter plot for log_price ~ longitude
prices_reg_df %>% 
  ggplot(aes(longitude, log_price)) + 
  # specify scatter plot
  geom_point() + 
  # add in linear model line of best fit
  geom_smooth(method = "lm", se = FALSE, colour = "red") +
  # theme
  theme_bw() + 
  labs(title = "Relationship between log(price) and longitude")
```
create model

```{r}
# linear regression model
model_1 <- lm(log_price ~ longitude, data = prices_reg_df)
```


```{r}
# check regression diagnostics
autoplot(model_1)
```


### Regression Diagnostics 

graph 1 (population) - tell us most observations are independence, blue line
declines at the end indicating other chunk of observations are not independently
distributed. Need another variable.

graph 2 (distribution) - shows a fairly normal distribution, again, a bend at the
end indicates model needs more to give a better distribution

graph 3 (homoskedasticity) - There is heteroskedasticity in the model 
indicated my curve on blue line 

graph 4 (outliers) - There is a small number of outliers, and points are not
highly leveraged. 


### model interpretation

```{r}
# obtain results of regression model
summary(model_1)
```

__coefficient: __

- An increase in longitude by one unit is associated with a change in price by 
__99.9%__ on average. 

R-squared - longitude explains 10.6% of the variance of airbnb price. 



## Multivariate Regression

From the ggpairs plot and previous analysis, neighbourhood_group is suggests 
having an impact on airbnb prices, I will add this next. 



## Relationship between Price and Borough

```{r}
# relationship log(price) ~ borough
prices_reg_df %>% 
  ggplot(aes(neighbourhood_group, log_price, fill = neighbourhood_group)) + 
  # specify boxplot
  geom_boxplot(show.legend = FALSE) +
  # theme
  theme_bw() + 
  # title
  labs(title = "Relationship between Price and Borough")
```

```{r}
# linear model 2
model_2 <- lm(log_price ~ longitude + neighbourhood_group, data = prices_reg_df)
```


```{r}
# regression diagnostics
autoplot(model_2)
```


### regression diagnostics

graph 1 - independent population (achieved)
graph 2 - still a skew in distribution
graph 3 - Homoskdasticity (achieved)
graph 4 - No highly leveraged points, but there are still (potentially) a few 
outliers


### model interpretation

```{r}
# regression results
summary(model_2)
```

__coefficients:__ 

- all statistically significant at levels of significance - can reject the null
hypothesis and say that coefficients are statistically different from zero. 

-  An increase in Manhattan by one unit (zero to one) is associated with a 
change in price by (e^0.27-1) * 100 = __31%__, holding all other factors 
constant 

- An increase in Staten Island by one unit (zero to one) is associated with a 
change in price by (e^-0.94 - 1) * 100 = __-60.9%__, holding all other factors 
constant


### Anova function

```{r}
# function to check if dummy variables are useful
anova(model_1, model_2)
```

anova function confirms the neighbourhood_group dummy variable was good to include
in the model. 


## Model 3 

before adding a new variable, must check the residuals of this model against the
remaining variables in the dataset


### check residuals


```{r message=FALSE}

# obtain residuals of model 2
residuals <- prices_reg_df %>% 
  # add in residuals to dataset
  add_residuals(model_2) %>% 
  # remove variables already in the model
  dplyr::select(-c(log_price, longitude, neighbourhood_group))
```


```{r message = FALSE}
# seperate into numeric and non-numeric 

# numeric data
price_resid_numeric <- residuals %>%
  select_if(is.numeric)

# non-numeric data
price_resid_nonnumeric <- residuals %>%
  select_if(function(x) !is.numeric(x))

# add residudals to non-numeric data
price_resid_nonnumeric$resid <- residuals$resid

# now ready for comparing model residuals to rest of data

# numeric ggpairs
ggpairs(price_resid_numeric)
```


```{r message=FALSE}
# non-numeric ggpairs
ggpairs(price_resid_nonnumeric)
```


From numeric variables, availability_365 has the highest correlation with price
a positive correlation of 0.131 (statistically significant). 

However, the non-numeric variables indicate that room type may present a better
explanation of price variance. 


### Comparison between impact of Room Type and Availability on Price

```{r}
# relationship residuals ~ room availability
avail_plot <- price_resid_numeric %>% 
  ggplot(aes(availability_365, resid)) + 
  # scatter plot 
  geom_point() +
  # insert linear model line of best fit
  geom_smooth(method = "lm", se = FALSE, colour = "red") + 
  # theme
  theme_bw() +
  # titles
  labs(x = "Yearly Room Availability", y = "Residuals", 
       title = "Relationship Residuals and Availability") +
  # adjust title font size
  theme(plot.title = element_text(size=10))


# relationship residuals ~ room_type
room_type_plot <- price_resid_nonnumeric %>% 
  ggplot(aes(room_type, resid, fill = room_type)) + 
  # boxplot
  geom_boxplot(show.legend = FALSE) +
  # theme
  theme_bw() + 
  # titles 
  labs(title = "Relationship Residuals and Room Type",
       x = "Room Type", y = "Residuals") +
  # adjust title font size
  theme(plot.title = element_text(size=10))


# plot both visuals together
cowplot::plot_grid(avail_plot, room_type_plot, nrow = 1) 
```

The comparison suggests room type should offer more explanation. 


```{r}
# linear model 3
model_3 <- lm(log_price ~ longitude + neighbourhood_group + room_type, 
              data = prices_reg_df)
```


```{r}
# check model diagnostics
autoplot(model_3)
```


graph 1 - residuals are independent
graph 2 - the residuals seem to be increasingly not distributed around zero
with more skew at the beginning and end
graph 3 - conditional variance of residuals is constant (homoskedasticity)


```{r}
# model results
summary(model_3)
```


As anticipated, room_type greatly enhanced the explanation of the model. The 
R-squared suggests the model explains 49.3% of the variance in price. 

The variable coefficients are all statistically significant therefore can be
interpreted. 


## Model 4

### Check residuals


```{r message=FALSE}
# compare model residuals to remaining data
residuals <- prices_reg_df %>% 
  # add residuals
  add_residuals(model_3) %>% 
  # remove variables in model
  dplyr::select(-c(log_price, longitude, neighbourhood_group, room_type))
```


```{r message = FALSE}

# seperate into numeric and non-numeric 

# numeric data
price_resid_numeric <- residuals %>%
  select_if(is.numeric)

# non-numeric data
price_resid_nonnumeric <- residuals %>%
  select_if(function(x) !is.numeric(x))


# add residuals to non-numeric data
price_resid_nonnumeric$resid <- residuals$resid

# numeric ggpairs
ggpairs(price_resid_numeric)
```


```{r message = FALSE}
# non-numeric ggpairs
ggpairs(price_resid_nonnumeric)
```


As before, availability is displaying by far the strongest correlation, and there
does not seem to be any stand out non-numeric variables. Therefore, room 
availability will be used in the next model. 


```{r}
# linear model 4
model_4 <- lm(log_price ~ longitude + neighbourhood_group + room_type + 
                availability_365, data = prices_reg_df)
```



```{r}
# check model diagnostics
autoplot(model_4)
```

graph 1 - model population is independently distributed
graph 2 - still skew at both ends of graph, indicating graph is only somewhat
normally distributed
graph 3 - conditional variance of residuals is constant (homoskedastic)



```{r}
# model results
summary(model_4)
```


R-squared has only increased marginally to 0.51 - model explains 51% of the 
variance in price

All variables in the model are statistically significant



## Adding an interaction term

potential terms:

- longitude:neighbourhood_group
- longitude:room_type
- longitude:availability_365
- neighbourhood_group:room_type
- neighbourhood_group:availability_365
- room_type:availability_365


Through process of elimination, longitude:neighbourhood_group 

```{r}
# linear model 5 - with interaction term
model_5 <- lm(log_price ~ longitude + neighbourhood_group + room_type + 
                availability_365 + longitude:neighbourhood_group,
              data = prices_reg_df)
```


### Plotting the interaction 

```{r}
# add model residuals to data
price_resid <- prices_reg_df %>% 
  # add model residuals
  add_residuals(model_5) %>% 
  # remove log_price
  dplyr::select(-log_price)


# check the interaction between longitude and neighbourhood_group
coplot(resid ~ longitude | neighbourhood_group,
       # give an action to be carried out in each panel
       panel = function(x, y, ...){
         # plot coordinates of x and y
         points(x, y)
         # insert linear model line
         abline(lm(y ~ x), col = "blue")
       },
       data = price_resid, rows = 1)
```


### Model Diagnostics

```{r}
# regression diagnostics
autoplot(model_5)
```

graph 1 - population is independent
graph 2 - model still not completely evenly distributed around 0
graph 3 - conditional variance of residuals is constant (homoskedasticity)
graph 4 - there are still some outliers, but no highly leveraged points


### Model Summary

```{r}
# model results
summary(model_5)
```
__Model Summary:__

- All explanatory variables are statistically significant

- R-squared: model explains 52.5% of variance in price
  - model suffices as an ok explanation
  
- Room_type had the greatest influence on the model
  - interpret the coefficient for room_type- entire/apt:
    - An increase in entire/apt by one unit (zero to one) is associated with a 
change in price by (e^0.73-1) * 100 = __107.5%__, holding all other factors 
constant


### Variable relative importance 

```{r message=FALSE}

# function to calculate variable importance
calc.relimp(model_5, type = "lmg", rela = TRUE)
```

__Variables of relative importance:__ 

Room type, perhaps unsurprisingly, is the most relevant variable for explaining
variations in price. 
The least explanatory is availability_365


# Conclusion and Recommendations

__For Consumers:__ 

- The room type advertised is very important to determining the price you pay
- location analysis suggests city centre areas demand much higher prices than 
outskirt areas such as Staten Island

__For hosts:__ 
- How you describe your property doesn't translate to being able to charge
higher prices
- will be able to charge a much higher price for property listed as whole apartment
- property availability does not have much impact on price


# Future analysis

More data on:

-yearly data
  - time-series analysis
    - identify cyclical trends